Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve create index #868

Closed
wants to merge 16 commits into from
Closed

Improve create index #868

wants to merge 16 commits into from

Conversation

gaurav274
Copy link
Member

👋 Thanks for submitting a Pull Request to EvaDB!

🙌 We want to make contributing to EvaDB as easy and transparent as possible. Here are a few tips to get you started:

  • 🔍 Search existing EvaDB PRs to see if a similar PR already exists.
  • 🔗 Link this PR to a EvaDB issue to help us understand what bug fix or feature is being implemented.
  • 📈 Provide before and after profiling results to help us quantify the improvement your PR provides (if applicable).

👉 Please see our ✅ Contributing Guide for more details.

# logic: We assume that the maximum number of files in the table is <=
# MAGIC_NUMBER and the number of frames or chunks for each video/document is <=
# MAGIC_NUMBER. Based on this assumption, we can safely say that
# `_row_id` * MAGIC_NUMBER + `chunk_id` for document table and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you determine the magic number? And when will be the chunking be done?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, I set it to an arbitrarily large value. The idea is that as long as the number of files in the table is fewer than this number, our assumption will hold.

Reg chunking: It is done when we read the document similar to frame decoding in videos.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. And for video, each video will get a unique ID and each frame will be assigned a different frame ID? Is this assumption correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have these IDs (_row_id for video and id for frame). Building on top of it. The assumption is these ids won't change across runs. _row_id is persisted, so no issue there. id is generated at runtime, and as long as the reader is deterministic across runs, we don't have a problem.

@@ -34,10 +34,12 @@ def _read(self) -> Iterator[Dict]:
doc = fitz.open(self.file_url)

# PAGE ID, PARAGRAPH ID, STRING
# Maintain a global paragraph number per PDF
global_paragraph_no = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this so that the calculated id for the create index method stays unique. Otherwise, it is still not unique because multiple paragraphs at different pages can have the same paragraph id. @gaurav274

jiashenC added a commit that referenced this pull request Sep 8, 2023
Use a runtime `row_number` to build the index by incorporating design
discussions from #912 and
#868.
@jiashenC
Copy link
Member

jiashenC commented Sep 8, 2023

Close this for now. #1073 is merged as a fix.

@jiashenC jiashenC closed this Sep 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants